Skip to content

fix: generate scrambled benchmark data with correct per-file RG ranges#21711

Open
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:fix/overlap-benchmark-data
Open

fix: generate scrambled benchmark data with correct per-file RG ranges#21711
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:fix/overlap-benchmark-data

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

@zhuqi-lucas zhuqi-lucas commented Apr 18, 2026

Which issue does this PR close?

Related to #21580

Rationale for this change

Both the inexact and overlap benchmark data generation had the same fundamental problem: writing a single parquet file with ORDER BY scramble/jitter causes the parquet writer to merge rows from adjacent chunks into the same RG at chunk boundaries. This widens RG ranges to span the full data range (~6M instead of ~100K), making reorder_by_statistics a no-op — there's nothing meaningful to reorder.

For the overlap benchmark specifically, the jitter formula also preserved ascending RG order by min values, so there was nothing to reorder even without the boundary-merging issue.

What changes are included in this PR?

Replace the single-file ORDER BY approach for both data_sort_pushdown_inexact() datasets with a two-step strategy:

  1. Write a single sorted file with small (100K-row) RGs via DataFusion, producing ~61 RGs with narrow, non-overlapping l_orderkey ranges.
  2. Split into per-RG files using pyarrow and rename with a deterministic permutation so alphabetical file order ≠ l_orderkey order.

Each file has a narrow range (~100K) but appears in scrambled order — reorder_by_statistics has real work to do.

Local benchmark results (reorder feature branch vs upstream/main):

Benchmark Query Baseline With reorder Change
inexact Q1 (DESC LIMIT 100) 4.99 ms 4.27 ms -14%
inexact Q2 (DESC LIMIT 1000) 2.54 ms 2.26 ms -11%
inexact Q3 (SELECT * LIMIT 100) 6.10 ms 5.92 ms ~0%
inexact Q4 (SELECT * LIMIT 1000) 4.66 ms 5.03 ms ~0%
overlap Q1 (DESC LIMIT 100) 5.03 ms 3.97 ms -21%
overlap Q2 (DESC LIMIT 1000) 2.17 ms 2.23 ms ~0%
overlap Q3 (SELECT * LIMIT 100) 6.20 ms 6.09 ms ~0%
overlap Q4 (SELECT * LIMIT 1000) 5.25 ms 4.90 ms -7%

Are these changes tested?

Benchmark data generation change only. Verified locally by:

  • Inspecting parquet metadata: each file has narrow ranges (~100K) in scrambled order
  • Running benchmark comparisons showing clear improvement with reorder_by_statistics

Are there any user-facing changes?

No. Adds pyarrow as a dependency for generating these specific benchmark datasets (pip install pyarrow).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).

Changes:

  • Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered l_orderkey, producing scrambled + overlapping RGs.
  • Updates benchmark script messaging/comments to reflect the new generation strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/bench.sh Outdated
Comment thread benchmarks/bench.sh Outdated
@zhuqi-lucas zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from ec93c3b to 8fb395a Compare April 18, 2026 06:33
@zhuqi-lucas zhuqi-lucas changed the title fix: generate scrambled+overlapping RGs for overlap benchmark fix: generate scrambled benchmark data with correct per-file RG ranges Apr 18, 2026
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

Addressed both Copilot comments — the ORDER BY + jitter/scramble approach was fundamentally flawed:

  1. chunk_id overflow: moot now, no more chunk_id computation in SQL.
  2. Comment wording about row-count chunks: also moot.

The root cause was that the parquet writer merges rows from adjacent chunks into the same RG at chunk boundaries, widening RG ranges to ~3M (instead of ~100K). This made reorder_by_statistics ineffective.

New approach: write a single sorted file with 100K-row RGs, then use pyarrow to split into per-RG files with scrambled names. Each file has a narrow range, scrambled order → reorder actually works. Local benchmark shows ~21% improvement on Q1 with the corrected data.

Both the inexact and overlap benchmark data generation had the same
fundamental problem: writing a single parquet file with ORDER BY
scramble/jitter causes the parquet writer to merge rows from adjacent
chunks into the same RG at chunk boundaries, widening RG ranges to
span the full data range (~6M instead of ~100K). This makes
reorder_by_statistics a no-op — there's nothing meaningful to reorder.

Fix both by using a two-step approach:
1. Write a single sorted file with small (100K-row) RGs
2. Use pyarrow to split into per-RG files with scrambled names

Each file has a narrow l_orderkey range (~100K) but appears in
scrambled alphabetical order, so reorder_by_statistics has real work
to do.

Local benchmark results (feature branch vs upstream/main):

inexact:
  Q1 (DESC LIMIT 100):    4.99ms -> 4.27ms (-14%)
  Q2 (DESC LIMIT 1000):   2.54ms -> 2.26ms (-11%)

overlap:
  Q1 (DESC LIMIT 100):    5.03ms -> 3.97ms (-21%)
  Q4 (SELECT * LIMIT 1000): 5.25ms -> 4.90ms (-7%)
@zhuqi-lucas zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from 8fb395a to 6aa52df Compare April 18, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants